Task 3.4.

Introduction

At first, some immediately necessary packages are loaded. Throughout the document, other packages will be added at the point where they are needed, to make it more clear, which package was used for what.

library(data.table) #package to import and work with datatables
library(lubridate) #package to transform dates
library(plotly) #package used here for interactive visualisations

Some important settings are made right at the beginning:

#Preparational tasks
unigecol = "#D20D63" #setting the colour that will be used for most visualisations (where applicable) to maintain a uniform impression
setwd("D:/Dokumente/Studium/Master/Université de Genève/Kurse/Creating Value Through Data Mining") #set wprking directory
sales <- fread("LaptopSales_red.csv") #read the datatable for this task
sales[, TransDate:=as_date(Date, format="%m/%d/%Y %H:%M")] #create a new column with transformed date-format

Analyses

As a very first step of the analysis, a summary of the whole datatable will be made.

summary(sales) #basic function to summarise a whole dataset.
##        V1             Date           Configuration   Customer.Postcode 
##  Min.   :     2   Length:148786      Min.   :  1.0   Length:148786     
##  1st Qu.: 74275   Class :character   1st Qu.:192.0   Class :character  
##  Median :148450   Mode  :character   Median :347.0   Mode  :character  
##  Mean   :148742                      Mean   :379.7                     
##  3rd Qu.:223162                      3rd Qu.:576.0                     
##  Max.   :297572                      Max.   :864.0                     
##                                                                        
##  Store.Postcode      Retail.Price   Screen.Size..Inches. Battery.Life..Hours.
##  Length:148786      Min.   :168.0   Min.   :15.00        Min.   :4.000       
##  Class :character   1st Qu.:440.0   1st Qu.:15.00        1st Qu.:4.000       
##  Mode  :character   Median :500.0   Median :15.00        Median :5.000       
##                     Mean   :508.1   Mean   :15.81        Mean   :4.973       
##                     3rd Qu.:575.0   3rd Qu.:17.00        3rd Qu.:6.000       
##                     Max.   :890.0   Max.   :17.00        Max.   :6.000       
##                     NA's   :6656                                             
##     RAM..GB.     Processor.Speeds..GHz. Integrated.Wireless.  HD.Size..GB.  
##  Min.   :1.000   Min.   :1.50           Length:148786        Min.   : 40.0  
##  1st Qu.:1.000   1st Qu.:1.50           Class :character     1st Qu.: 40.0  
##  Median :2.000   Median :2.00           Mode  :character     Median : 80.0  
##  Mean   :2.101   Mean   :1.93                                Mean   :132.2  
##  3rd Qu.:2.000   3rd Qu.:2.40                                3rd Qu.:120.0  
##  Max.   :4.000   Max.   :2.40                                Max.   :300.0  
##                                                                             
##  Bundled.Applications.   customer.X       customer.Y        store.X      
##  Length:148786         Min.   :512253   Min.   :164886   Min.   :517917  
##  Class :character      1st Qu.:529098   1st Qu.:178716   1st Qu.:528924  
##  Mode  :character      Median :530928   Median :181083   Median :529902  
##                        Mean   :530748   Mean   :179890   Mean   :530644  
##                        3rd Qu.:533076   3rd Qu.:182060   3rd Qu.:534057  
##                        Max.   :549065   Max.   :199846   Max.   :541428  
##                                                          NA's   :85      
##     store.Y         TransDate         
##  Min.   :168302   Min.   :2008-01-01  
##  1st Qu.:178440   1st Qu.:2008-07-03  
##  Median :179641   Median :2008-09-03  
##  Mean   :179757   Mean   :2008-08-26  
##  3rd Qu.:181567   3rd Qu.:2008-11-06  
##  Max.   :190628   Max.   :2008-12-30  
##  NA's   :85       NA's   :110


The dataset includes sales data for laptops during the year 2008. The set is structured in 18 variables and contains 148,786 observations. Each observation stands for one sold item and includes the configuration of the sold laptop, the location of the customer and the store that sold the laptop, its retail price and the date of purchase.
Rows with NA-values will not be removed right at the beginning but in the proceeding analyses, so that rows are only ignored if they have missing values that would interfere with this particular analysis. So can be ensured, that most of values are available for each analysis.

a) Price Questions

i)

At what price are laptops actually selling?

To get a first impression of how selling prices are distributed, a single boxplot will be used at measurement.

ggplotly(
  ggplot(sales, aes(y=sales$Retail.Price)) + 
  geom_boxplot(fill = unigecol, color = "black", outlier.colour = unigecol) +
  labs(title="Retail Prices",x="", y = "Price in GBP") +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())
)


This plot roughly tells that most of values (75%) are above GBP 440 and below GBP 575. The median is exactly GBP 500.
To get deeper insight about the distribution of prices, different plots can be used. In the following, a combination of histogram and density plot will be employed. Possible peaks, other anormalites or even skewness that a boxplot cannot reveal might become visible.
In addition, skewness will be calculated.

library(ggpubr) # package for "gghistogram" and "ggdensity"
library(cowplot) # package that enables aligning of graphs
library(e1071) # package for skewness

PricesHistogram <- gghistogram( # creating a histogram
  sales, x = "Retail.Price", # sales is the data input, on the x-axis, there shall be the retail price
  add = "mean", # add a vertical line on the mean
  fill = unigecol # the bars shall be filled with the document colour
)
  
PricesDensity <- ggdensity( # creating a density plot
  sales, x = "Retail.Price", # sales is the data input, on the x-axis, there shall be the retail price
  color = "black" # the colour of the density line shall be black
) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.05)), position = "right")  + # denotes the second y-scale for the density
  theme_half_open(11, rel_small = 1) + # further description of the y-axis
  rremove("x.axis") + # this and the following line: no doubled ticks, no doubled labels 
  rremove("xlab") +
  rremove("x.text") +
  rremove("x.ticks") +
  rremove("legend") +
ggtitle(label="Overview retail price distribution") # the plot title

aligned_plots <- align_plots(PricesHistogram, PricesDensity, align="hv", axis="tblr") # align plots
ggdraw(aligned_plots[[1]]) + draw_plot(aligned_plots[[2]]) # combine plots

skewness(sales$Retail.Price, na.rm = TRUE) # compute skewness
## [1] 0.3054539


The distribution of prices is approximately normal distributed, according to the skewness-calculation, and no severe anormalities could be detected.

ii)

Does price change with time?


# Preparation of values
pricesTimes <- data.table("Price" = sales$Retail.Price, "Date" = sales$TransDate, "Month" = month(sales$TransDate)) #create a new datatable with the retail price, date and month of each bservation
pricesTimes <- data.table(pricesTimes, "Month.Abb" = month.abb[pricesTimes$Month]) # add the abbreviated month names in a new column
pricesTimes <- pricesTimes[complete.cases(pricesTimes),] # remove rows with missing values
weekdaysabb = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun") # set abbreviations for the weekdays for a later analysis


To take several timeframes, from larger to smaller resolutions, the next representations will show the change of price over time.
The first two graphs give an overview over the whole year 2008 in a monthly resolution.

library(ggplot2) # package for plots
PricesMonths <- sales[, mean(Retail.Price, na.rm = TRUE), by= month(TransDate)] # new datatable including the average price of every month
PricesMonths <- data.table(PricesMonths, "Month.Abb" = "") # add new column for the month abbreviations
PricesMonths$Month.Abb <- month.abb[PricesMonths$month] # setting the month abbreviations to the new column
names(PricesMonths) <- c("Month", "AvgPrice", "Month.Abb") # renaming the columns
PricesMonths <- PricesMonths[complete.cases(PricesMonths), ] # removing rows with missing values
#PricesMonths <- PricesMonths[order(Month)]

ggplotly(
  ggplot(data=PricesMonths, aes(x=reorder(Month.Abb, Month), y=AvgPrice, group = 1)) + # new ggplot, data is the new datatable, on the x-axis there shall be the months, in the correct order
    geom_line(color=unigecol) + # the points shall be in the uniform colour
    geom_point(color=unigecol) + # the lines shall be in the uniform colour
    labs(title = "Average Prices (Months, Lineplot)", x = "Month", y = "Mean price in GBP") # naming title and axis
)


According to this graph, there is a steep incline in prices from May to July from around GBP 450 to a peak in July of nearly GBP 540. After this peak, prices decrease again.

ggplotly(
  ggplot(pricesTimes, aes(x=reorder(Month.Abb, Month), y=Price)) + # new ggplot with "pricesTimes" as data, months in the correct order on the x-axis and the price on the y-axis
      geom_boxplot(fill = unigecol, color = "black", outlier.colour = unigecol) + # stating colours for the boxplot
      labs(title = "Average Prices (Months, Boxplot)", x = "Month", y = "Price in GBP") # creating labels of the graph
)


The previously described development can also be seen in this boxplot even though not as dramatic as before since right here, median and prices are shown instead of mean prices and the y-scale depicts a much larger range to cover whiskers and outliers. So, in addition to price increase in the early summer months 2008, we can also observe a sudden increase in variation of the values. In contrast to the again decreasing prices, seen in the previous graph, the higher variability and number of outliers is persistent in the following months.

The following two graphs do the same as the two ones before, but with a weekly resolution, again, beginning with the line graph.

PricesWeeks <- sales[, mean(Retail.Price, na.rm = TRUE), by = week(TransDate)] # creating a new datatable with the weekly mean price, removing NAs
names(PricesWeeks) <- c("Week", "AvgPrice") # renaming
PricesWeeks <- PricesWeeks[complete.cases(PricesWeeks), ] # again removing NAs because previously, they were only removed from the retail price but not from the date-column
PricesWeeks <- PricesWeeks[order(Week)] # ordering the data after weeks
ggplotly( #ggplotly for interactivity
  ggplot(data=PricesWeeks, aes(x=Week, y=AvgPrice)) + # new ggplot with "PricesWeeks" as data, the weeks on the x-axis and the mean price on the the y-axis
    geom_line(color=unigecol) + # colour the line
    geom_point(color=unigecol) + # colour the points
    ggtitle("Average Prices (Weeks)") # creating a title
)

As before, a steep increase from May to July can be observed, even though it is denoted in weeks.

ggplotly(
  ggplot(pricesTimes, aes(x = week(pricesTimes$Date), y=Price)) + # as above
      geom_boxplot(fill = unigecol, color = "black", outlier.colour = unigecol) + # as above, just as a boxplot
      ggtitle("Average prices weekly resolution") + # naming the title
      xlab("Weeks") + # label x-axis
      coord_cartesian(xlim = c(1, 52)) # setting the limits of the x-axis in both directions
)


Also in this graph, the same insights as from the monthly graphs can be gained, just denoted in weeks.
Finally, the weekly graph is constructed differently and with a slightly different purpose. It shows a boxplot for every weekday of the week. This plot’s justification is a possibly varying shopping behaviour from people during weekdays and the weekends that might have its origin in a different clientèle on different days.

pricesTimes <- pricesTimes[complete.cases(pricesTimes[ , c(1:3)]),] # removing rows with missing values in the first 3 columns
ggplotly(
  ggplot(pricesTimes, aes(x=weekdays(pricesTimes$Date), y=Price)) + #new ggplot, data is the "pricesTimes" datatable, on the x-axis there are the weekdays, on the y-axis, there is the price
    geom_boxplot(fill = unigecol, color = "black", outlier.colour = unigecol) + #stating the colours of the graph
    scale_x_discrete(labels = weekdaysabb) +
    labs(title = "Average prices (weekdays)", x = "Weekdays", y = "Price")
)


The above mentioned assumptions could not be justified by the data. No clear difference can be seen between the different days.


iii)

Are prices consistent across retail outlets?

The pursued approach in this question is to find the mean price for every Postcode and make this visible through a standard plot.

zipsPrices <- data.table("Price" = sales$Retail.Price, "Store" = sales$Store.Postcode) # new datatable with postcodes of stores and retail prices for all observations
ggplotly( # ggplotly for interactivity
  ggplot(zipsPrices, aes(x=Store, y=Price)) + # new ggplot with zipsPrices as data, stores on the x-axis and prices on the y-axis
      geom_boxplot(fill = unigecol, color = "black", outlier.colour = unigecol) + # colouring the bixplots
      labs(title = "Average Prices (Stores, Boxplot)", x = "Stores", y = "Price in GBP") + # labelling the graph
      theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1), legend.position = "none") # flipping the names of the stores by 90 degrees to make the better readable
)


It can be observed, that there are two kinds of stores, according to price. The one group has a median price of GBP 460-470 and the other group of roughly GBP 510. To make this more obvious, the following plot will show nothing but the mean prices of the stores.

zipsMeanPrices <- sales[, mean(Retail.Price, na.rm = TRUE), by = Store.Postcode] # find the mean price for every store
names(zipsMeanPrices) = c("Postcodes", "AvgPrice") # renaming

# Plotting the mean prices for every store, order increasingly
ggplotly(
  ggplot(zipsMeanPrices, aes(x = reorder(Postcodes, AvgPrice), y = AvgPrice)) +
    geom_point(size = 2, shape = 16, color = unigecol) +
    ggtitle("Retail Prices and Zip-Codes") +
    ylab("Retail price in GBP") +
    xlab("Stores") +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1), legend.position="none")
)


This graph confirms the observation from above and enables a clear distinction between the 5 stores with lower average prices (mean and median, both) and the ones with higher values.

iv)

How does price change with configuration?

As in the task before, the approach is to find the mean price, here for every configuration. The result is visible by the following plot:

MP_Conf <- sales[,Mean_Price:=mean(Retail.Price, na.rm = TRUE), by = Configuration] # new datatable with the mean reatil price for each configuration
ggplotly( # ggplotly for interactivity
  ggplot(MP_Conf) + # new ggplot with MP_Conf as data
    geom_line(aes(x = Configuration , y = Mean_Price), # linegraph with Configuration on the x-axis and mean price on the y-axis
              na.rm = TRUE, color = unigecol, size = 0.5) + # removing NAs, seeting colour to the uniform colour
    labs(title = "Mean Price and Configuration", y = "Price in GBP") # labelling the graph
)


The graph shows a positive relation between configuration and price, meaning the higher the configuration name, the more expensive the laptop will be on average. Nevertheless, much variation can be seen, so that for one variation, many different prices were established. In addition, it seems to be that there are 6 tubes of configuration, each one in a similar shape. The y-value (price) of these tubes slightly increases from tube to tube and with rising number of Configuration. One can also observe that the price at the lower end of a tube is far below the previous tube’s end price, while the end price of one tube exceeds the one of the previous.
It is possible that one tube or cluster is one big group of a family of configurations and this family has a lot sub-configurations, from basic to high-end products.

b) Location Questions

i)

Where are the stores and customers located?

The first visualization that will be shown here, will show the distribution of customers and stores in accordance to the British national grid coordinate system epsg code. In this step, no map will be underlying.

store_size <- sales[,sum(Retail.Price, na.rm=TRUE)/1000000, by = c("Store.Postcode", "store.X", "store.Y")] # creating a new datatable including Postcode and X- and Y- variables for the stores position and getting the sum of sales for each store
names(store_size) <- c("Store.Postcode", "store.X", "store.Y", "Sales in Mio") # renaming

ggplot() +
  geom_point(aes(x = sales$customer.X, y = sales$customer.Y, color = "Customers")) + # 1. scatterplot with x- coordinate on the x-axis and corresponding for y (for customers), data are directly taken from the sales datatable, the points shall be coloured homogenously for all customers to differentiate customers from stores, the same applies for stores
  geom_point(aes(x = store_size$store.X, y = store_size$store.Y, size = store_size$`Sales in Mio`, color = "Stores")) + # 2. datatable (for stores), set size corresponding to the previous calculated sales volume for each shop
 labs(title = "Distribution of Customers and Stores", size = "Sales in Mio", col = "Type", x = "x-value", y = "y-value") # labelling the plot


Since there is no underlying map yet, the X- and Y- data of stores and customers will be transformed into coordinates and then printed into a “real” map. Since the same data are used, just in a different way, the results will remain the same.

library(ggmap)    # maps visualizations
library(osmdata)  # open street maps
library(sp)       # spatial data
library(rgdal)    # needed for transformation of data
library(tidyverse)     # needed for data-handling
library(OpenStreetMap) # package need for the map itself
library(geosphere)     # needed for distance-calculation

customers_locations <- na.omit(unique(sales[, c("Store.Postcode", "customer.X", "customer.Y")])) # setting up a datatable where each customers only appears once, including postcode and the X- and Y-variable
stores_locations <- na.omit(unique(sales[, c("Store.Postcode", "store.X", "store.Y")])) # as above

# Create SpatialPointsDataframe
customers_locations_SP <- SpatialPointsDataFrame( 
  data = data.frame(customers_locations$Store.Postcode),    # data, stores names (postcodes)
  customers_locations[, c("customer.X", "customer.Y")],           # coordinates, x for "Easting", y for "Northing"
  proj4string = CRS( "+init=epsg:27700" ) )  

stores_locations_SP <- SpatialPointsDataFrame( 
  data = data.frame(stores_locations$Store.Postcode),    # data, stores names (postcodes)
  stores_locations[, c("store.X", "store.Y")],           # coordinates, x for "Easting", y for "Northing"
  proj4string = CRS( "+init=epsg:27700" ) )              # proj4string of the coordinates, assign CRS to data

# Transform coordinates
customers_locations_SP_LL <- spTransform(customers_locations_SP, CRS("+init=epsg:4326"))
stores_locations_SP_LL <- spTransform(stores_locations_SP, CRS("+init=epsg:4326"))

# Transform to dataframe
customers_locations_LL <- data.frame(customers_locations_SP_LL)[, c(1:3)]      # keep first 3 columns
customers_locations_LL <- cbind("Type"="Customer", customers_locations_LL) # label customers as those by adding a new column using cbind
colnames(customers_locations_LL) <- c("Type", "Customers.Postcode", "Customers.Longitude","Customers.Latitude") # renaming

stores_locations_LL <- data.frame(stores_locations_SP_LL)[,c(1:3)]      # keep first 3 columns
stores_locations_LL <- cbind("Type"="Store", stores_locations_LL) # as above
colnames(stores_locations_LL) <- c("Type", "Store.Postcode", "Store.Longitude","Store.Latitude")    # rename stores variables

#stores_customers_LL <- rbind(stores_locations_LL, customers_locations_LL)
stores_sales <- sales[, sum(Retail.Price, na.rm = TRUE), by = Store.Postcode] # new datatable, calculating again the sales of the stores
names(stores_sales) <- c("Store.Postcode", "Sales") # renaming
stores_locations_LL <- merge(stores_locations_LL, stores_sales, by= "Store.Postcode") # merge both datatables

london <- get_map(getbb("London"), source = "osm") # create map london, where the stores are located

ggmap(london) +
  geom_point(data=customers_locations_LL,aes(x = Customers.Longitude, y = Customers.Latitude, colour = "Customers")) + # 1. scatterplot for customers
  geom_point(data = stores_locations_LL,aes(x = Store.Longitude, y = Store.Latitude, colour = "Stores", size = Sales/1000000)) + # 2. scatterplot for stores
  labs(title = "Distribution of customers and stores", x = "Longitude", y = "Latitude", size = "Sales in Mio",col = "Type") # labelling the plot


In both plots, one can see that the highest density of customers is in the center. From the second plot one can learn that the center of customers is located in the center of London. Further away from this center, the density declines. The stores are more distracted here, and the ones with the highest revenue are located in the center. The bigger the distance from the center, the smaller the revenue.


ii)

Which stores are selling the most?

stores_sales_cumulative <- sales[,sum(Retail.Price, na.rm = TRUE), by = Store.Postcode]
names(stores_sales_cumulative) <- c("Store.Postcode", "Sales")
stores_sales_cumulative <- stores_sales_cumulative[order(-Sales)]
stores_sales_cumulative <- data.table(stores_sales_cumulative, cumsum(stores_sales_cumulative$Sales))
names(stores_sales_cumulative) <- c("Store.Postcode", "Sales", "Sum_of_sales")

ggplotly(
  ggplot(stores_sales_cumulative,aes(x=reorder(Store.Postcode, -Sales))) +
      geom_bar(aes(y = stores_sales_cumulative$Sales/1000000), fill = unigecol, stat = "identity") +
      geom_line(aes(y = stores_sales_cumulative$Sum_of_sales/1000000, group = 1)) +
      theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1), legend.position="none") +
      labs(title = "Sales by stores", x = "Stores", y = "Sum of sales in Mio GBP")
)

This Pareto-Chart shows how the single stores distributed with their sales to the overall sales. The maximum value is roughly GBP 12 Mio and the minimum GBP 40.000, so there is a large variety in revenue.
According to the sales number, the following table delivers information.

num_sales <- sales[,.N, by = sales$Store.Postcode] # counting the number of sales per store
num_sales <- num_sales[order(-N),] # order the data decreasingly by the number
names(num_sales) <- c("Store", "Number of sales") # renaming
DT::datatable(num_sales) # table outplut

The order of the contribution of the shops to the overall sales number stays exactly the same as in the diagram above. Therefore, one can conclude that sales number and revenue are strongly positively correlated, what was to expect.
The precise number is the following.

cor(num_sales$N, stores_sales_cumulative$Sales) # correlation between revenue and number of sales
## [1] 0.9828617

iii)

How far would customers travel to buy a laptop?

It is important to understand, how far customers are travelling to the stores, so in this abstract the shortest connection between a customer and a store (euclidean distance) will be calculated.

# new datatable with the location data and some additional data for all observations
distances <- data.table("customer.Postcode" = sales$Customer.Postcode,
                        "Date" = sales$TransDate,
                        "customer.X" = sales$customer.X,
                        "customer.Y" = sales$customer.Y,
                        "store.Postcode" = sales$Store.Postcode,
                        "store.X" = sales$store.X,
                        "store.Y" = sales$store.Y
                        ) 

distances <- distances[complete.cases(distances),] # only complete cases

# splitting of customers and stores
customers_locations <- distances[, c("customer.Postcode", "customer.X", "customer.Y")]
stores_locations <- distances[, c("store.Postcode", "store.X", "store.Y")]

# Create SpatialPointsDataframe
customers_locations_SP <- SpatialPointsDataFrame( 
  data = data.frame(customers_locations$customer.Postcode),    # data as a dataframe, stores names (postcodes)
  customers_locations[, c("customer.X", "customer.Y")],        #  X and Y coordinates for stores, x for "Easting", y for "Northing"
  proj4string = CRS( "+init=epsg:27700" ) )  

stores_locations_SP <- SpatialPointsDataFrame( 
  data = data.frame(stores_locations$store.Postcode),    # data as a dataframe, stores names (postcodes)
  stores_locations[, c("store.X", "store.Y")],           # X and Y coordinates for stores, x for "Easting", y for "Northing"
  proj4string = CRS( "+init=epsg:27700" ) )              # proj4string of the coordinates, assign CRS to data

# Transform coordinates
customers_locations_SP_LL <- spTransform(customers_locations_SP, CRS("+init=epsg:4326"))
stores_locations_SP_LL <- spTransform(stores_locations_SP, CRS("+init=epsg:4326"))

# Transform to dataframe
customers_locations_LL <- data.frame(customers_locations_SP_LL)[,c(1:3)]      # keep first 3 columns
colnames(customers_locations_LL) <- c("Customer.Postcode", "Customer.Longitude", "Customer.Latitude") # rename customer variables

stores_locations_LL <- data.frame(stores_locations_SP_LL)[,c(1:3)]      # keep first 3 columns
colnames(stores_locations_LL) <- c("Store.Postcode", "Store.Longitude", "Store.Latitude")    # rename stores variables

distances <- data.table(customers_locations_LL, stores_locations_LL, "Date" = distances$Date, "Month" = month(distances$Date), "Month.Abb" = month.abb[distances$Date]) # create a new datatable by combining the customers locations and stores locations datatable and adding a new column for the distance

distances$dist <- distHaversine(distances[,c(2,3)], distances[,c(5,6)]) # using distHaversine, it is possible to calculate the distance between 2 coordinates in meter. The outcome is the euclidean distance.
distances$Month.Abb <- month.abb[distances$Month] # adding month abbreviations to the distances datatable

Now, since the distances for all sales are calculated, a boxplot can give an overview about the distances travelled.

#Plotting all distances in a boxplot
ggplotly(
  ggplot(data = distances, aes(y = (distances$dist/1000))) +
    geom_boxplot(fill = unigecol, outlier.colour = unigecol) +
    labs(title = "Travelled distances", x = "", y = "Distance in km") +
    theme(axis.text.x=element_blank(),
          axis.ticks.x=element_blank())
)


One can see that the travelled distance to buy a laptop was in the central 75 % of the cases between 2.4 km and 4.3 km.
As in a) i) possible anormalities that the boxplot cannot show might become visible in a density/bar plot. Reusing the graph from a) i) leads to the following result.

distances$km <- distances$dist/1000
  DistanceHistogram <- gghistogram( # creating a histogram
    distances, x = "km", # distances is the data input, on the x-axis, there shall be the retail price
    add = "mean", # add a vertical line on the mean
    fill = unigecol # the bars shall be filled with the document colour
  )
  
  DistanceDensity <- ggdensity( # creating a density plot
    distances, x = "km", # distances is the data input, on the x-axis, there shall be the retail price
    color = "black" # the colour of the density line shall be black
  ) +
    scale_y_continuous(expand = expansion(mult = c(0, 0.05)), position = "right")  + # denotes the second y-scale for the density
    theme_half_open(11, rel_small = 1) + # further description of the y-axis
    rremove("x.axis") + # this and the following line: no doubled ticks, no doubled labels 
    rremove("xlab") +
    rremove("x.text") +
    rremove("x.ticks") +
    rremove("legend") +
  labs(title = "Distance distribution") # the plot title
  
  aligned_plots <- align_plots(DistanceHistogram, DistanceDensity, align="hv", axis = "tblr") # align plots
  ggdraw(aligned_plots[[1]]) + draw_plot(aligned_plots[[2]]) # combine plots

skewness(distances$dist)
## [1] 1.73793
median(distances$km)
## [1] 3.369205

The distribution of distances is strong and positively skewed. That makes sense because a travelled distance is limited on the left side by 0 since one cannot travel less than 0 km to a store but there is basically no upper limit for the distance. Nevertheless, a median of about 3.4 km can be observed (which is pretty close to the left border of 0).
It is to consider that the density does not become negative but the y-axes have a different level of 0.

iv)

Try an alternative way of looking at how far customers traveled. Do this by creating a new column that computes the distance between customer and store.

Doing so, the data from iii) are used and the computation will not be repeated for this task.
An alternative way to think about how far customers would travel to buy a laptop is to look at the data under another aspect. It might be useful here, to examine the travelled distance in regard to the store customers went. It seems possible that we could observe significant differences and outlying stores.

ggplotly(
  ggplot(distances, aes(x = distances$Store.Postcode, y = dist/1000)) + # new ggplot, data is the previously created datatable distances, the x-axis represents the store (postcode), the y-axis the distance in km
    geom_boxplot(fill = unigecol) + # colouring the boxplot
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1), legend.position="none") + #adjusting legend and orientation
    labs(title = "Stores and the distances travelled to them", x = "Stores", y = "Distance in km") # labelling
)


Unfortunately, it was not possible to find clearly outlying stores in terms of distance travelled to them, even though there are differences observable. Following statements can be made:
Customers normally don’t travel more than 10 km, no matter of which store they are going to. Nevertheless, outliers appear up to 20 km. When looking at these numbers, it is important to keep in mind that these distances are direct, straight connections, so in most cases, the distances are underestimated.

c) Revenue Questions

i)

How do the sales volumes in each store relate to Acell’s revenues?

rev <- sales[,sum(Retail.Price, na.rm = TRUE), by = Store.Postcode] # new datatable with the sales for each store
  colnames(rev) <- c("Postcode","Revenue in Mio GBP") # renaming
  rev <- rev[order(`Revenue in Mio GBP`/1000000),] # ordering by revenue (sales)
  
ggplot(rev, aes(fill = Postcode, y=`Revenue in Mio GBP`, x = 1, order = -as.numeric(`Revenue in Mio GBP`))) + # new ggplot with data from the previously created datatable
    geom_bar(position ='stack', stat="identity") + # setting the position to stack 
    labs(title = "Contribution of stores \n to overall revenue", x = "") + # labelling
    theme(axis.text.x = element_blank(),
          axis.ticks.x = element_blank()) # no x-axis-labels


This graph tells that there are about three stores that have the largest sales, namely SW1V 4QQ, SW1P 3AU and SE1 2BN. Unfortunately, it is not very clear, so creating a treemap might solve this problem.

library(treemapify) # package for treemaps
names(rev) <- c("Store.Postcode", "Revenue") # renaming

#assigning new variables to create a treemap
Store.Postcode <- rev$Store.Postcode
value <- rev$Revenue
subgroup <- rev$Store.Postcode

ggplot(rev, aes(area = value, label = subgroup, fill = Store.Postcode)) + # new ggplot with the previously created datatable
  geom_treemap(color = "white") + # colouring
  geom_treemap_text(color = "white", grow = TRUE) # colouring


The result stays the same but is presented more clearly.

ii)

How does the relationship depend on the configuration?

# The same approach as for the treemap of stores and sales is now pursued but for the configuration, so I will not further comment on it.
rev <- sales[, sum(Retail.Price, na.rm = TRUE), by = Configuration] # new datatable with the revenue for each configuration
colnames(rev) <- c("Configuration","Revenue") # renaming
rev <- rev[order(Revenue), ] # reodering

# setting/preparing data for the treemap
group <- rev$Configuration
value <- rev$Revenue
store_contribution <- data.frame(group, value)

#plotting the treemap
ggplot(store_contribution, aes(area = value, fill = group)) + # area refers to the sales volume and group is the respective Configuration number
  geom_treemap(color = "white") + # the in-between line4s shall be white
  labs(title = "Contribution of configurations to the overall sales", fill = "Configuration") # title and optics


Unfortunately, the amount 864 configuration makes it difficult to learn a lot from this graph but it is basically not possible to find a way of representation where the reader learns about 864 values the same time.
Nevertheless, it can be seen, that the contrasts in color are much stronger in the upper-right corner, the corner, where the fields representing the sales of one configuration are smaller. So, the more extreme the value of a configuration is, the less smaller its revenue. Looking even closer, one can see that in the abolute corner, rather dark colours dominate, referring to small configurations, while rather in the center-stripe, there is a big brighter strip, representing higher configurations. medium colours are in the down-left corner, representing medium configurations. Remembering the previous insight that the higher a configuration number is, the more expensive (on average) the laptop will be. Applying this to the current graph, it says, that medium priced laptops (with medium configuration) account for larger sales than those with a high price (high configuration, most probably top-end models) and their revenue, again, exceed the one of cheap laptops (low configuration, most probably basic models). This makes sense because most people are “average”-users, so the demand for those laptops and therefore revenue will be higher with them. Special configurations (higher and smaller configuration, both) do not sell as good as the medium ones.

d) Configuration questions

i) 1)

What are the details of each configuration?

library(dplyr) # for piping values
library(DT) # for datatable representation

# creating a new datatable that contains all information about each configuration, data are directly taken from sales
# finally summing up the number of laptops sold for each configuration
conf_comp <- sales %>%
  group_by(Configuration,
           "Screen Size in Inches" = Screen.Size..Inches.,
           "Battery life in h" = Battery.Life..Hours.,
           "RAM in GB" = RAM..GB.,
           "Processor speed in GHz" = Processor.Speeds..GHz.,
           "Integrated Wireless feature" = Integrated.Wireless.,
           "Memory in GB" = HD.Size..GB.,
           "Bundled Applications" = Bundled.Applications.) %>%
  summarise("Number of sold Laptops with same configuration" = n())

DT::datatable(conf_comp) # showing the table in a clean and complete way


When ordering the table following configuration, it becomes visible, that the lower the configuration number is, the weaker the characteristics get.

i) 2)

How does this relate to price?

Replotting graph from a) iv) in a slightly changed way shows the relation between configuration and price.

PrCo <- sales[order(Configuration), mean(Retail.Price, na.rm=TRUE), by = Configuration] # new datatable with the mean retail price of each configuration, ordered by configuration
names(PrCo) <- c("Configuration", "Mean_Price_in_GBP") # renaming

# plotting the same graph as above
ggplotly(
  ggplot(PrCo) +
    geom_point(aes(x = Configuration, y = Mean_Price_in_GBP), na.rm = TRUE, color = unigecol) + 
                              labs(title = "Mean Price and Configuration", y = "Mean Price in GBP") +
    geom_smooth(method = lm, aes(Configuration, Mean_Price_in_GBP), se = FALSE, color = "black")
)


The higher the configuration, the higher the price will be. As in a) iv) already described, we can see that the rise in prices goes in clusters that are similarly shaped.

PrCo$Mean_Price_in_GBP <- round(PrCo$Mean_Price_in_GBP, digits = 2) # rounding values to the 2. digit
DT::datatable(PrCo) # showing the data


The same result as described above can be drawn from the table as well.

ii)

Do all stores sell all configurations?

There are several approaches how this question can be solved. The first one presented is based on aggregation function on columns and rows.

numTable <- sales[,.N,by=c("Store.Postcode", "Configuration")] # creating a new table by aggregating the number of items per store and configuration
numTable <- numTable[, .N, by = Store.Postcode] # aggregating again, by store
numTable <- numTable[order(N),] # reordering
names(numTable) <- c("Store_Postcode", "Number") # renaming


The second approach is by the help of a loop, in this case a for-loop.

library(janitor) # needed for function tabyl
#library(sqldf)
medAggDat <- data.table(tabyl(sales[,c(3,5)],  Configuration, Store.Postcode)) # creating a table the shows the number of each configuration sold by each store

# preparing tables and values for the loop
loopTable <- data.table("Store_Postcode" = "0", "Number" = integer(16))
counter = 0
numConfigurations <- nrow(medAggDat)
numStores <- length(medAggDat)

for (i in 2:17){ #goes through all columns/stores
  for (j in 1:numConfigurations){ # goes through all rows/configurations
    if (medAggDat[j,..i]<1){ #if a shop sold 0 of one -> configuration,
      counter = counter + 1 # -> that will be counted ->
    }
  }
  loopTable[i-1,1] = colnames(medAggDat)[i] # renaming columns
  loopTable[i-1,2] = 864 - counter # -> and here deducted
  counter = 0 # reset counter for the next column/store
}

compTable <- data.table(merge(loopTable, numTable, by = "Store_Postcode"),"Match"="") # merge resulting tables from both ways
names(compTable) <- c("Store_Postcode", "Num_Agg", "Num_Loop", "Match") # renaming
compTable$Match <- compTable$Num_Agg == compTable$Num_Loop # the result of comparison of both tables' results shall be displayed in the new column
compTable <- compTable[order(Num_Agg),] # reordering
DT::datatable(compTable) # display the data


The data displayed are the number of configurations that were actually sold by one store.
All the data are matching, both approaches led to the same result.
Only 4 stores sold all configurations, most of the rest ranges between 648 and 863. Only one store (S1P 3AU) sold much less, only 80 configurations.